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1 Abstract 

In today's digital society, electronic information is increasingly shared among different entities, and decisions 
are made based on common attributes. To address associated privacy concerns, the research community has be- 
gun to develop cryptographic techniques for controlled (privacy-preserving) information sharing. One interesting 
open problem involves two mutually distrustful parties that need to assess the similarity of their information sets, 
but cannot disclose their actual content. This paper presents the first efficient and provably-secure construction for 
privacy-preserving evaluation of sample set similarity, measured as the Jaccard similarity index. We present two 
fy| protocols: the first securely computes the Jaccard index of two sets, the second approximates it, using MinHash tech- 

niques, with lower costs. We show that our novel protocols are attractive in many compelling applications, including 
. . document similarity, biometric authentication, genetic tests, multimedia file similarity. Finally, we demonstrate that 

our constructions are appreciably more efficient than prior work. 



^t" ■ 1 Introduction 

>: 

The availability of electronic information is increasingly essential to the functioning of our communities and, 
in numerous circumstances, data needs to be shared between parties without complete mutual trust. This naturally 
raises commensurate privacy concerns with respect to the disclosure, and long-term safety, of sensitive contents. One 
interesting problem occurs whenever two or more entities need to evaluate the similarity of their information, but are 
reluctant to disclose data wholesale. This task faces three main technical challenges: (1) how to identify a meaningful 
metric to estimate similarity, (2) how to compute a measure of it such that no private information is revealed during 
the process, and (3) how to do so efficiently. We denote this problem as Privacy-preserving Evaluation of Sample Set 
Similarity and we motivate it below, vis-a-vis a few relevant applications. 



Document similarity: Two parties need to estimate the similarity of their documents (or collections thereof) — in 
many settings, documents contain sensitive information and parties may be unwilling, or simply forbidden, to reveal 
their content. For instance, program chairs of a conference may want to verify that none of submitted papers is also 
under review in other conferences or journals, but, naturally, they are not allowed to disclose papers in submission. 
Likewise, two law enforcement authorities (e.g., the FBI and the local police), or two investigation teams with different 
clearance levels, might need to share documents pertaining suspect terrorists, but they can do so only conditioned upon 
a clear indication that content is relevant to the same investigation. 

Iris Matching: Biometric identification and authentication are increasingly used due to fast and inexpensive devices 
that can extract biometric information from a multitude of sources, e.g., voice, fingerprints, iris, and so on. Clearly, 
given its utmost sensitivity, biometric data must be protected from arbitrary disclosure. Consider, for instance, an 
agency that needs to determine whether a given biometric appears on a government watch-list. As agencies may have 
different clearance levels, privacy of biometric' s owner needs to be preserved if no matches are found, but, at the same 
time, unrestricted access to the watch-list cannot be granted. 

Genetic Paternity/Ancestry Testing: Advances in DNA sequencing technologies will soon yield ubiquitous and 
low-cost full sequencing of human genomes [15]. This will stimulate the deployment of algorithms that perform, in 
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computation, various genomic tests, such as genetic paternity and ancestry testing. Since individuals tied by parent- 
child or ancestry relationships carry almost identical genomes, an attractive technique to establish such ties is to 
measure the genomes' similarity. However, it is well-known that (human) genomic information is extremely sensitive, 
as a genome not only uniquely identifies an individual, but it also reveals information about, e.g., ethnic heritage, 
disease predispositions, and many other phenotypic traits [20]. 

Multimedia File Similarity: Digital media, e.g., images, audio, video, are increasingly relevant in today's computing 
ecosystems. Consider two parties that wish to evaluate similarity of their media files, e.g., for plagiarism detection: 
sensitivity of possibly unreleased material (or copyright issues) may prevent parties from revealing actual contents. 

All examples above exhibit some common features: neither party can reveal its information in its entirety. What 
they are willing to reveal is limited to a metric that assesses their similarity. This paper presents the design of efficient 
cryptographic techniques for privacy-preserving evaluation of sample set similarity. Such techniques do not only 
appeal to examples above, but to any setting where parties need to evaluate similarity of sets, independently of their 
nature. It is relevant to a wide spectrum of applications, for instance, in the context of privacy-preserving sharing of 
information and/or recommender systems, e.g., to privately assess similarity of social network profiles, social interests, 
network traces, attack logs, healthcare information, and so on. 

1.1 Technical Roadmap & Contributions 

Our first step is to identify a metric for effectively evaluating similarity of sample sets. While several measures 
have been proposed, such as, cosine similarity (for vectors), Hamming and Levenshtein distances (for strings), Eu- 
clidean, Manhattan, or Minkowski similarity (for geometric objects), Pearson coefficient (for statistical data), we 
focus on a well-known, widely used, measure: the Jaccard Similarity Index [24]. It quantifies the similarity between 
two sets A and B, and is a rational number between and 1 . Experimental and analytical results indicate that high val- 
ues of the Jaccard index capture well the informal notion of ' 'roughly the same' ' [9] and can be used, e.g., to find near 
duplicate records [54] and similar documents [9], for web-page clustering [50], data mining [51], and genetics [17, 43]. 
As sample sets can be relatively large, in distributed settings, an approximation of the index is oftentimes preferred 
to its exact calculation. To this end, MinHash techniques [9] are used to estimate the Jaccard index, with remarkably 
lower computation and communication costs (see Section 2.1). 

In this paper, we define and instantiate a cryptographic primitive geared for privacy-preserving evaluation of 
sample set similarity. We design two efficient protocols, allowing two interacting parties to compute (resp., approx- 
imate) the Jaccard index of their private sets, without reciprocally disclosing any information about their content (or, 
at most, an upper bound on their size). Our main cryptographic building block is Private Set Intersection Cardinality 
(PSTCA) [21], introduced in Section 2.2. Specifically, we use PSI-CA to privately compute the magnitude of set 
intersection and union, and derive the value of the Jaccard index. As fast (linear-complexity) PSTCA protocols be- 
come available, this can be done efficiently, even on large sets. Nonetheless, our work shows that, using MinHash 
approximations, one can obtain an estimate of the Jaccard index with remarkably increased efficiency — i.e., reducing 
the size of input sets, thus, the number of (costly) cryptographic operations. 

Privacy-preserving evaluation of sample set similarity is appealing in many real-world scenarios. We focus on 
document similarity, iris matching, and multimedia file similarity, and show that privacy is attainable with low over- 
head. Our experiments demonstrate that our generic technique — while not bounded to any specific application — is 
appreciably more efficient than state-of-the-art protocols that only focus on one specific scenario, while maintaing 
comparable accuracy. Finally, in the process of reviewing related work, we identify limits and flaws of some prior 
results. 

Organization. The rest of this paper is structured as follows. Next section introduces building blocks, then Section 3 
presents our construction for secure computation of Jaccard index and an even more efficient technique to (privately) 
approximate it. Then, Sections 4, 5, and 6 present our constructions for privacy-preserving similarity evaluation of, 
respectively, documents, irises, and multimedia content. Finally, Section 7 sketches a very fast protocol to privately 
approximate set intersection cardinality, that additionally hides set sizes, while the paper concludes in Section 8. 
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2 Preliminaries 



This section provides some relevant background information on Jaccard index, MinHash techniques, and our main 
cryptographic building blocks. 

2.1 Jaccard Similarity Index and MinHash Techniques 

Jaccard Index. One of the most common metrics for assessing the similarity of two sets (hence, of data they represent) 
is the Jaccard index [24]. It measures the similarity between two sets A and B as J {A, B) — \Ad B\/\A(J B\. High 
values of the index suggest that two sets are very similar, whereas, low values indicate that A and B are almost disjoint. 
The Jaccard index of A and B can be rewritten as a mere function of the intersection: J(A, B) = \AD B\/(\A\ + 
\B\-\AHB\). 

MinHash Techniques. Clearly, computing the Jaccard index incurs a complexity linear in set sizes. Thus, in the 
context of a large number of big sets, its computation might be relatively expensive. In fact, for each pair of sets, 
the Jaccard index must be computed from scratch, i.e., no information used to calculate J(A, B) can be re-used for 
J (A, C). (That is, similarity is not a transitive measure.) As a result, an approximation of the Jaccard index is often 
preferred, as it can be obtained at a significantly lower cost, e.g., using MinHash techniques [9]. Informally, MinHash 
techniques extract a small representation hk{S) of a set S through deterministic (salted) sampling. This representation 
has a constant size fc, independent from \S\, and can be used to compute an approximation of the Jaccard index. The 
parameter k also defines the expected error with respect to the exact Jaccard index. Intuitively, larger values of k 
yield smaller approximation errors. The computation of hk(S) also incurs a complexity linear in set sizes, however, 
it must be performed only once per set, for any number of comparisons. Thus, with MinHash techniques, evaluating 
the similarity of any two sets requires only a constant number of comparisons. Similarly, the bandwidth used by two 
interacting parties to approximate the Jaccard index of their respective sets is also constant (0(h)), 

There are two strategies to realize MinHashes: one employs multiple hash functions, while the other is built from 
a single hash function. 1 

MinHash with many hash functions. Let J 7 be a family of hash functions that map items from set U to distinct T-bit 
integers. Select k different functions /iW(-), . . . , ftW(-) from T\ for any set S C U, let h^ in (S) be the item s £ S 
with the smallest value h^\s) . The MinHash representation hk (S) of set S is a vector hk(S) = {h^ in (S)}^ =1 . The 
Jaccard index J {A, B) is estimated by counting the number of indexes is, such that that h^ in (A) = h^ in (B), and 
this value is then divided by k. Observe that it holds that h^ in (A) = h^ in (B) exactly when the minimum hash value 
of A U B lies in A n B. 

This measure can be obtained by computing the cardinality of the intersection of (A) and hk (B), in the follow- 
ing way. Each element a t of the vector hk(A) is encoded as (a i5 i). Similarly, (6,, i) represents the i-th element of 
vector hk(B). An unbiased estimate of the Jaccard index between A and B is given by: 

sim{A.B) = - (1) 

k 

As discussed in [10], if J 7 is a family of min-wise independent hash functions, then each value of a fixed set A has 
the same probability to be the element with the smallest hash value. Specifically, for each min-wise independent hash 
function ftW(-) and for any set S, we have that, for any Sj,si € S, Pr[sj = h^ in (S)] — Pr[s; = h^ in (S)]. Thus, 
we also obtain that Pr[hQ n (A) = h^ in (B)] — J{A, B). In other words, if r is a random variable that is 1 when 

^-mm(^) = hmin(B) and otherwise, then r is an unbiased estimator of J (A, B)\ however, in order to reduce its 
variance, such random variable must be sampled several times, i.e., k 1 hash values must be used. In particular, by 
Chernoff bounds [12], the expected error of this estimate is 0(1/ V~k). 

Approximating (Jaccard) Similarity of Vectors without MinHash. If one needs to approximate the Jaccard index 
of two fixed-length vectors (rather than sets), one could use other (more efficient) techniques similar to MinHash. 

1 This paper focuses on the former technique, thus, we defer the description of the latter to Appendix A, which also overviews possible MinHash 
instantiations. 
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Client, on input 

C = {ci, . . . ,cv} 




Server, on input 

S = {si, . . . , s w } 


R c <— Z q 




(si, .. . ,s w ) <— 11(6) 


Vil<i<v: 




Y? 1 ^ j ^ w '■ h,Sj = H(Sj) 


hci = H(ci); 


{ai, . . . , a v } 








Rs i — 






Vi 1 < i < y : a i ~ 






(a'„ , . . . , a'„ ) = n'(a', ,..,,o!,) 




{a' ei ,...,a' e J 


Vjl<j<w: bsj = (hsj) Rs 




Vj 1 < i < w : tsj = H'{bsj) 


Vil<i<v. t — 






ba = (a' £ .) 1 / R " mod « 


{tsi, . . . , ts w } 




Vi 1 < i < v. 






ta = H'(ba) 






Output: \{tsi, . . . ,ts w } n {tci, ■ 


■■,tCv}\ 





Figure 1: PSI-CA protocol from [16], executed on common input of two primes p and q (such that q\p — 1), a generator g of a 
subgroup of size q and two hash functions, H and H', modeled as random oracles, n(-) and n'(-) denote random permutations. All 
computation is mod p. 

Observe that a vector S can be represented as a set S — {(si, i)}, where Si is simply the i-th element of S . We now 
discuss an efficient strategy to approximate the Jaccard index of two vectors A = {(ai, B = {(bi, i}}f = i of 

length n, without using MinHash. The approach discussed here incurs constant (0(k)) computational and communi- 
cation complexity, i.e., it is independent from vectors' length of the vectors being compared. First, select k random 
values (ri, r^), for r; uniformly distributed in [1, n], and compute = {(a ri , ri)}^ =1 and Bk = {(b ri , ri)}^ =1 , 
respectively. The value S = \Ak flBtj/fc can then be used to assess the similarity of A and B. We argue that S is an 
unbiased estimate of J(A, B): for each a G (A k U B k ) we have that Pr[a G (A k n Bk)] = Pr[a G (A n B)] since 
a G (A n B) A a G (A k U B k ) a G (A k n B k ). We also have Pr[a e (An B)] = J{A, B), thus, 5 is indeed an 
unbiased estimate of J (A, B). 

The above algorithm implements a perfect min-wise permutation for this setting: since elements (n, . . . , rk) are 
uniformly distributed, for each i G [1, k] any element in A and _B has the same probability of being selected. As such, 
similar to MinHash with many hash function, the expected error is also 0(1/ Vk). 

2.2 Cryptography Background 

Private Set Intersection Cardinality (PSI-CA) is a cryptographic protocol involving two parties: Alice, on input 
A = {ai, . . . , a w }, and Bob, on input B = {b%, . . . , b v }, such that Alice outputs \A n B\, while Bob has no output. 

In the last few years, several PSI-CA protocols [21, 53, 31, 16] have been proposed, that are secure in differ- 
ent security models and under different assumptions. We choose the protocol in [16] as it achieves communication 
and computation complexity linear in set sizes. As a result, throughout this paper, we use the PSI-CA construction 
from [16], which is secure, in the presence of semi-honest adversaries, under the DDH in the Random Oracle Model 
(ROM). It requires 0(|v4| + |£>|) offline and 0(|>1|) online modular exponentiations in Z p with exponents from sub- 
group Z g . (Offline operations are computed only once, for any number of interactions and any number of interacting 
parties). Communication overhead amounts to 0(|^4|) elements in Z p and 0(|i?|) — in 7L q . (Assuming 80-bit security 
parameter, |g| = 160 bits and |p| = 1024 bits.) The Private Set Intersection Cardinality (PSI-CA) protocol from [16] 
is outlined in Figure 1. 

Protocol correctness is easily verifiable: for any Ci held by client and sj held by server, if = Sj, hence, hci — 
hsj, we obtain: 

tc h = H'{bc u ) = H'(a e ^^ R ^) = H'(h Ci R >) = H'(hs 3 R °) = H'(b Sj ) = t Sj 
Hence, client learns set intersection cardinality by counting the number of matching pairs (tc£ i , tsj). 
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Adversarial Model: We use standard security models for secure two-party computation, which assume adversaries 
to be either semi-honest or malicious. 2 As per definitions in [22], protocols secure in the presence of semi-honest 
adversaries assume that parties faithfully follow all protocol specifications and do not misrepresent any information 
related to their inputs, e.g., size and content. However, during or after protocol execution, any party might (passively) 
attempt to infer additional information about other party's input. Whereas, security in the presence of malicious parties 
allows arbitrary deviations from the protocol. Security arguments in this paper are made with respect to semi-honest 
participants. 

3 Privacy-preserving Sample Set Similarity 
3.1 Private Computation of Jaccard Index 

We now present our first construction for privacy-preserving computation of Jaccard Index. We consider two 
parties, Alice and Bob, on input sets A and B, respectively. We show that parties can efficiently compute J(A, B) in 
a privacy-preserving manner using protocol illustrated in Figure 2 below. 



Privacy-preserving computation of J(A, B) 

(Run by Alice and Bob, on input, resp., A, B) 

1. Alice and Bob execute PSI-CA on input, resp., (A, \B\) and (B, \A\) 

2. Alice learns c = \A n B\ 

3. Alice computes u = \A U B\ = \A\ + \B\ - c 

4. Alice outputs J(A, B) = c/u 

Figure 2: Proposed protocol for privacy-preserving computation of set similarity. 

Complexity. Clearly, cost of protocol in Figure 2 is dominated by that incurred by the underlying PSI-CA protocol. 
As we select the PSI-CA construction of [16], which incurs linear communication and computational complexities, 
overall complexities of protocol in Figure 2 are also linear in the size of sets. If we were to compute the Jaccard index 
without privacy, asymptotic complexities would be same as our privacy-preserving protocol — i.e., linear. However, 
given the lack of cryptographic operations, constants hidden by the big O(-) notation would be much smaller. 

Security. Our main security claim is that, by running the protocol in Figure 2, parties do not reciprocally disclose the 
content of their private sets, but only learn similarity computed as the Jaccard index and the size of the other party's 
input (hence, the cardinality of set intersection and union). Therefore, security of protocol in Figure 2 relies on that of 
the underlying PSI-CA instantiation. In particular, Alice and Bob do not exchange any information besides messages 
related to the PSI-CA protocol. For this reason, a secure implementation of the underlying PSI-CA guarantees that 
neither Alice nor Bob learn additional information about the other party's set (thus, we omit detailed formal proofs to 
ease presentation). 

Performance Evaluation. Our technique for secure Jaccard index computation can be instantiated using any PSI- 
CA construction. Nonetheless, to maximize efficiency, we choose the one in [16]. To assess the practicality of 
resulting construction, protocol in Figure 2 has been implemented in C (with OpenSSL and GMP libraries), using 
160-bit random exponents and 1024-bit moduli to obtain 80-bit security. We assume |A| = j-B] = 1000 and set items 
to be hashed using SHA-1, thus, they are 160-bit. In this setting, protocol in Figure 2 would incur (i) about 0.5s 
total computation time on a single Intel Xeon E5420 core running at 2.50GHz and (ii) 276KB in bandwidth. We 
omit running times for larger sets since, as complexities are linear, one can easily derive a meaningful estimate of 
time/bandwidth for any size. 

We also implement an optimized prototype that further improves total running time by (1) pipelining computation 
and transmission and (2) parallelizing computation on two cores. We test the prototype by running Alice and Bob on 
two PCs equipped with 2 quad-core Intel Xeon E5420 processors running at 2.50GHz, however, we always use (at 

2 Hereafter, the term adversary refers to protocol participants. Outside adversaries are not considered, since their actions can be mitigated via 
standard network security techniques. 
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most) 2 cores. On a conservative stance, we do not allow parties to perform any pre-computation offline. We simulate 
a 9Mbps link, since, according to [40], it represents the current average Internet bandwidth in US and Canada. 

In this setting, and again considering \A\ = \B\ = 1000, total running time of protocol in Figure 2 amounts to 
0.23s. Whereas, the computation of Jaccard index without privacy takes 0.018s. Therefore, we conclude that privacy 
protection, in our experiments, only introduces a 12-fold slowdown, independently from set sizes. 

Comparison to prior work. Performance evaluation above does not include any prior solutions, since, to the best 
of our knowledge, there is no comparable cryptographic primitive for privacy-preserving computation of the Jaccard 
index. The work in [47] is somewhat related: it targets private computation of the Jaccard index using Private Equal- 
ity Testing (PET) [33] and deterministic encryption, however, it introduces the need for a non-colluding semi-honest 
third party, which violates our design model. Also, it incurs an impractical number of public -key operations, i.e., 
quadratic in the size of sample sets (as opposed to linear in our case). Finally, additional (only vaguely) related 
techniques include: (i) work on privately approximating dot product of two vectors, such as, [44, 29], and (ii) prob- 
abilistic/approximated private set operations based on Bloom filters, such as, [29, 30]. (None of this techniques, 
however, can be used to solve problems considered in this paper.) 

3.2 Private Estimation of Jaccard Index based on MinHash 

The computation of the Jaccard index, with or without privacy, can be relatively expensive when (1) sample sets 
are very large, or (2) each set must be compared with a large number of other sets. Thus, MinHash techniques, 
introduced in Section 2.1, are often used to estimate the Jaccard index, trading off an expected error with appreciably 
faster computation. We now show how to privately approximate the similarity of two sample sets combining MinHash 
and PSI-CA. Our construction is general and does not assume any specific instantiation, given that it is a multi-hash 
MinHash. 

Recall, from Section 2.1, that Jaccard index can also be approximated as sim(A, B) = |{(a,-, «)}f =1 n{{6j, i)}^ =1 \/k, 
where a, = h^ in (A) and bi = h^ in (B). Therefore, privacy-preserving estimation of the Jaccard index, using multi- 
hash MinHash, can be reduced to securely computing cardinality of set intersection above. The resulting protocol is 
presented in Figure 3 below. 



Private Jaccard index estimation sim(A, B) 

(Run by Alice and Bob, on input, resp., A, B) 




1. Alice and Bob compute, {(ai, i)}i=i and {(bi, i)} i=1 , resp., using multi-hash MinHash 

2. Alice and Bob execute PSI-CA on input, resp., ({(ai, i) }* = i, k) and ({{bi, i)} 4=1 , k) 

3. Alice learns 8 = |{(a», i)}i=i PI {{h, i)}j fc =1 

4. Alice outputs sim(A, B) = 5/k 




Figure 3: Proposed protocol for privacy-preserving approximation of set similarity. 



It is easy to observe that, compared to the Jaccard index computation (Section 3), the use of MinHash leads to 
executing PSI-CA on smaller sets, as it holds k <C Min(|A|, \B\). Thus, communication and computation overhead 
depends on k, since inputs to PSI-CA are now sets of k items, independently from the size of A and B. 

Security and Extensions. Similar to the protocol in Section 3.1, the security of protocol in Figure 3 relies on the 
security of the underlying PSI-CA construction. 

In this protocol, the Alice learns some additional information compared to the protocol in Section 3.1. In particular, 
rather than computing the similarity — and therefore the size of the intersection — of sets A and B, she determines how 
many elements from a particular subset of A (constructed using minhash) also appear in the subset selected from B. 

To fix this issue, Alice and Bob can construct their input sets (Step 1 in Figure 3) using a set of OPRFs rather than a 
set of hash functions: Alice and Bob engage in a multi-party protocol where Alice inputs her set A = {ai, . . . , a v } and 
learns a random permutation of OPRFfe e!/j (ai), ... ,OPRFfe eyj (a„)forl < j < k. Alice constructs her input selecting 
the smallest value OPRF/^ (a;) for each j. Bob constructs his input without interacting with Alice. While the cost of 
this protocol is linear in the size of the input sets, it is significantly higher than that of protocol Section 3.1. 
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Performance Evaluation. We also tested the performance of our construction for privacy-preserving approximation 
of Jaccard similarity, again using the PSI-CA from [16]. We used the same setting of Section 3.1, i.e., we selected sets 
with 1000 items, 1024-bit moduli and 160-bit random exponents, and ran experiments on two PCs with 2.5GHz CPU 
and a 9Mbps link. Selecting k = 400, thus, bounding the error to 5%, the total running time of protocol in Figure 3 
amounts to 0.09s — less than half compared to the one in Figure 2. Whereas, in the same setting, the approximation 
of Jaccard index without privacy takes 0.007s. Thus, the slow-down factor introduced by the privacy-protecting layer 
(similar to the protocol proposed in Section 3.1) is 12-fold. Again, note that times for different set sizes can be easily 
estimated since the complexity of the protocol is linear. 

Prior Work. The estimation of set similarity through MinHash — whether privacy-preserving or not — requires 
counting the number of times for which it holds that h^ in (A) = h^ in (B), with i = 1, . . . , k. We have denoted this 
number as 5. Protocol in Figure 3 above attains secure computation of 5 through privacy-preserving set intersection 
cardinality. However, it appears somewhat more intuitive to do so by using the approach proposed by [4] in the 
context of social-network friends discovery. Specifically, in [4], Alice and Bob compute, resp., {ai}^ =1 and {bi}^ =1 , 
just like in our protocol. Then, Alice generates a public-private keypair (pk, sk) for Paillier's additively homomorphic 
encryption cryptosystem [42] and sends Bob {zi = Enc p k{ai)}^ =1 . Bob computes {{zt ■ Enc p k(—bi)) ri }^ =1 for 
random ri's and returns the resulting vector of ciphertexts after shuffling it. Upon decryption, Alice learns <5 by 
counting the number of O's. Nonetheless, the technique proposed by [4] actually incurs an increased complexity, 
compared to our protocol in Figure 3 (instantiated with PSI-CA from [16]). Assuming 80-bit security parameters, 
thus, 1024-bit moduli and 160-bit subgroups, and 2048-bit Paillier moduli, and using m to denote a multiplication 
of 1024-bit numbers, multiplications of 2048-bit numbers count for 4m. Using square-and-multiply, exponentiations 
with q-bit exponents modulo 1024-bit count for (1.5|q|)m. In [4], Alice performs k Paillier encryptions (i.e., 2k 
exponentiations and k multiplications) and k decryptions (i.e., k exponentiations and multiplications), while Bob 
computes k exponentiations and multiplications. Therefore, the total computation complexity amounts to (4 • 4 • 1.5 • 
1024 + 3 • 4)km = 24, 588km. Whereas, our approach (even without pre-computation) requires both Alice and Bob to 
perform 4k exponentiations of 160-bit numbers modulo 1024-moduli and 2k multiplications, i.e., (4-1.5T60+2)/cm = 
962km, thus, our protocol achieves a 25-fold efficiency improvement. Communication overhead is also higher in [4] : 
it amounts to (2 • 2048)fc bits; whereas, using PSI-CA, we need to transfer (2 • 1024 + 160)fc bits, i.e., slightly more 
than half the traffic. 

4 Privacy-Preserving Document Similarity 

After building efficient (linear-complexity) primitives for privacy-preserving computation/approximation of Jac- 
card index, we now explore their applications to a few compelling problems. We start with evaluating the similarity 
of two documents, which is relevant in many common applications, including copyright protection, file management, 
plagiarism prevention, duplicate submission detection, law enforcement. In last few years, the security community 
has started investigating privacy-preserving techniques to enable detection of similar documents without disclosing 
documents' actual contents. Below, we first review prior work and, then, present our technique for efficient privacy- 
preserving document similarity. 

4.1 Related Work 

The work in [27] (later extended in [38]) is the first to realize privacy-preserving document similarity. It realizes 
secure computation of the cosine similarity of vectors representing the documents, i.e., each document is represented 
as the list of words appearing in it, along with the normalized number of occurrences. Recently, Jiang and Saman- 
thula [28] have proposed a novel technique relying on the Jaccard index and A-gram based document representa- 
tion [37]. (Given any string, an A-gram is a substring of size A). According to [28], the A-gram based technique 
presents several advantages over cosine similarity: (1) it improves on finding local similarity, e.g., overlapping of 
pieces of texts, (2) it is language-independent, (3) it requires a much simpler representation, and (4) it is less sensitive 
to document modification. We overview it below. 

Documents as sets of A-grams. A document can be represented as a set of A-grams contained in it. To obtain such a 
representation, one needs to remove spaces and punctuation and build the set of successive A-grams in the document. 
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An example of a sentence, along with its iV-gram representation (for N = 3), is illustrated in Figure 4. The similarity 
of two documents can then be estimated as the Jaccard index of the two corresponding sets of iV-grams. In the context 
of document similarity, experts point out that 3 results as a good choice of N [9]. 



the quick brown fox jumps over the lazy dog 

4- 

{azy, bro, ckb, dog, ela, equ, ert, fox, hel, heq, ick, jum, kbr, laz, mps, nfo, 
ove, own, oxj, pso, qui, row, rth, sov, the, uic, ump, ver, wnf, xju, ydo, zyd} 

Figure 4: Tri-gram representation. 

To enable privacy -preserving computation of Jaccard index, and therefore estimation of document similarity, Jiang 
and Samanthula [28] propose a two-stage protocol based on Paillier's additively homomorphic encryption [42]. Sup- 
pose Alice wants to privately evaluate the similarity of her document Da against a list of n documents held by Bob, i.e., 
Db.i, ■ ■ ■ , Db-.u- First, Bob generates a global space, \S\, of tri-grams based on his document collection. This way, 
Da as well as each of Bob's document, Db-.u can be represented as binary vectors in the global space of tri-grams: 
each component is 1 if the corresponding tri-gram is included in the document and otherwise. We denote with A the 
representation of Da and with Bi that of Db-a- Then, Alice and Bob respectively compute random shares a and bj such 
thata + bi = \AP\Bi\. Next, they set c — \A\ — aanddj = \B\ — bi. Finally, Alice and Bob, on input (a,c)and (bi,di), 
resp., execute a Secure Division protocol (e.g., [11, 1]) to obtain (a + bi) / (c + di) = \AP\ Bi\/\AU Bi\ = J (A, Bi). 

The computational complexity of the protocol in [28] amounts to Od^l) Paillier encryptions performed by Alice, 
and 0(n ■ \S\) modular multiplications — by Bob. Whereas, communication overhead amounts to 0(n ■ \S\) Paillier 
ciphertexts. 

Flaw in [28]. Unfortunately, protocol in [28] is not secure, since Bob has to disclose his global space of tri-grams 
(i.e., the set of all tri-grams appearing in his document collection). Therefore, Alice can passively check whether or 
not a word appears in Bob's document collection. Actually, Alice can learn much more, as we show in Appendix B. 
We argue that this flaw could be fixed by considering the global space of tri-grams as the set of all possible tri-grams, 
thus, avoiding the disclosure of Bob's tri-grams set. Assuming that documents are stripped of any symbol and contain 
only lower-cased letters and digits, we obtain S = {a, 6, . . . , z, 0, 1, . . . , 9} 3 . Unfortunately, this modification would 
tremendously increase computation and communication overhead. 

4.2 Our Construction 

As discussed in Section 3, we can realize privacy-preserving computation of the Jaccard index using PSI-CA. To 
privately evaluate the similarity of documents Da and any document D b-a, Alice and Bob execute protocol in Figure 5. 
Function Tri-Gram(-) denotes the representation of a document as the set of tri-grams appearing in it. 



Alice (D A ) 


Bob (Db-a) 


A <- Tri-Gram(£>A) 


Bi <- Tri-Gram(D S i ) 


\AnBi\ <- PSI-CA( J 4,B») 
Output Similarity as J (A, Bi) 


\AnBi\ 

\A\ + \Bi\-\AflBi\ 



Figure 5: Privacy-preserving evaluation of document similarity of documents Da and Db a- 

Complexity. Complexity of protocol in Figure 5 is bounded by that of the underlying PSI-CA construction. Using 
the technique in [16], computational complexity amounts to 0(|^4| + \Bi\) modular exponentiations, whereas, com- 
munication overhead — to 0(|A| + Observe that, in the setting where Alice holds one documents and Bob a 
collection of n documents, complexities should be amended to 0(n|A| + X)"=i I^D- However, due to the nature of 
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Alice (D A ) 


Bob (D B ..i) 


A <- Tri-Gram(DA) 
h k (A) <- MinHash(vl) 


B t <- Tri-Gram(D S:i ) 
h k {Bi) <- MinHash(B l ) 


|ft*(A)n/ifc(-Bi)| <- 1 

PSl-CA(h k (A),h k (BA) 
Output Similarity Approximation as: 





Figure 6: Privacy-preserving approximation of document similarity of documents Da and 

protocolin [16], Bob can perform 0(X)"=i 1-^*1) computation off-line, ahead of time. Hence, total online computation 
amounts to 0(n|A|). 

More efficient computation using MinHash. As discussed in Section 2.1, one can approximate the Jaccard index 
by using MinHash techniques, thus, trading off accuracy with significant improvement in protocol complexity. The 
resulting construction is similar to the one presented above and is illustrated in Figure 6. It adds an intermediate step 
between the tri-gram representation and the execution of PSI-CA: Alice and Bob apply MinHash to sets A and Bi, 
respectively, and obtain hk(A) and h k {Bi). The main advantage results from the fact that PSTCA is now executed on 
smaller sets, of constant size k, thus, achieving significantly improved communication and computational complexities. 
Again, note that the error is bounded by 0(l/y/k). 

Performance Evaluation. We now compare the performance of our constructions to the most efficient prior tech- 
nique, i.e., the protocol in [28] (that, unfortunately, is insecure). We consider the setting of [28], where Bob main- 
tains a collection of n documents. Recall that our constructions use the PSTCA in [16]. Assuming 80-bit secu- 
rity parameters, we select 1024-bit moduli and 160-bit random exponents. As [28] relies on Paillier encryption, 
it uses 2048-bit moduli and 1024-bit exponents. In the following, let m denote a multiplication of 1024-bit num- 
bers. Multiplications of 2048-bit numbers count for 4m. Modular exponentiations with g-bit exponents modulo 
1024-bit count for (1.5|g|)m. The protocol in [28] requires 0(151) Paillier encryptions and 0(n ■ \S\) modular 
multiplications. As pointed above, we need IS*! — 36 3 = 46,656. Therefore, the total complexity amounts to 
(4 • 1.5 • 1024 + An)\S\m = (6144 + 4n)\S\m « (2.9 • 10 8 + 1.9 • 10 5 n)m. 

Our construction above requires (2 • 1.5 ■ 160n|A|)m for the computation of Jaccard index similarity and (1.5 ■ 
160nk)m for its approximation. Thus, to compare performance of our protocol to that of [28], we need to take into 
account the dimensions of A, Bi, as well as n and k. To this end, we collected 393 scientific papers from the KDDcup 
dataset of scientific papers published in ArXiv between 1996 and 2003 [14]. The average number of different tri-grams 
appearing in each paper is 1307. Therefore, cost of our two techniques can be estimated as (2 • 1.5 • 160 • 1307n)m 
and (1.5 • 160 • nk)m, respectively. Thus, our technique for privacy-preserving document similarity is faster than [28] 
for n < 2000. Furthermore, using MinHash techniques, complexity is always faster (and of at least one order of 
magnitude), using both k — 40 and k = 100. Also, recall that, as opposed to ours, the protocol in [28] is not secure. 

Assuming that it takes about 1 ps to perform modular multiplications of 1024-bit integers (as per our experiments 
on a single Intel Xeon E5420 core running at 2.50GHz), we report estimated running times in Table 1 for increasing 
values of n (i.e., the number of Bob's documents). 

We performed some statistical analysis to determine the real magnitude of the error introduced by MinHash, when 



n 


[28] 


Figure 5 


Figure 6 


k = 100 


k = 40 


10 


9.5 mins 


6.3 sees 


0.05 sees 


0.05 sees 


10 2 


9.9 mins 


63 sees 


1.9 sees 


1.9 sees 


10 3 


12.7 mins 


10.4 mins 


48 sees 


19.2 sees 


10 4 


40.7 mins 


1 .74 hours 


8 mins 


3.2 mins 


10 s 


5.3 hours 


17.4 hours 


1.2 hours 


32 mins 



Table 1: Computation time of privacy-preserving document similarity. 
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compared to the Jaccard index without MinHash. Our analysis is based on the trigrams from documents in the KDDcup 
dataset [14], and confirms that the average error is within the expected bounds: for k = 40, we obtained an average 
error of 14%, while for k = 100 the average error was 9%. This is acceptable, considering that the Jaccard index 
actually provides a normalized estimate of the similarity between two sets, not a definite metric. 

5 Privacy-Preserving Iris Matching 

Advances in biometric recognition enable the use of biometric data as a practical mean of authentication and 
identification. Today, several governmental agencies around the world perform large-scale collections of different 
biometric features. As an example, the US Department of Homeland Security (DHS) collects face, fingerprint and 
iris images, from visitors within its US- VISIT program [52]. Iris images are also collected from all foreigners, by 
the United Arab Emirates (UAE) Ministry of Interior, as well as fingerprints and photographs from certain types of 
travelers [23]. 

While biometry serves as an excellent mechanism for identification of individuals, biometric data is, undeniably, 
extremely sensitive and must be subject to minimal exposure. As a result, such data cannot be disclosed arbitrarily. 
Nonetheless, there are many legitimate scenarios where biometric data should be shared, in a controlled way, between 
different entities. For instance, an agency may need to determine whether a given biometric appears on a government 
watch-list. As agencies may have different clearance levels, privacy of biometric' s owner should be preserved if no 
matches are found, but, at the same time, unrestricted access to the watch-list cannot be granted. 

5.1 Related Work 

As biometric identification techniques are increasingly employed, related privacy concerns have been investigated 
by the research community. A number of recent results address the problem of privacy-preserving face recognition. 
The work in [18] is the first to present a secure protocol, based on Eigenfaces, later improved by [45]. Next, [41] 
designs a new privacy-preserving face recognition algorithm, called SCiFI. Furthermore, the protocol in [6] realizes 
privacy-preserving fingerprint identification, using FingerCodes [25]. FingerCodes use texture information from a 
fingerprint to compare two biometrics. The algorithm is not as discriminative as traditional fingerprint matching 
techniques based on location of minutiae points, but it is adopted in [6] given its suitability for efficient privacy- 
preserving realization. Among all biometric techniques, this paper focuses on iris-based identification. The problem 
of privacy-preserving iris matching has been introduced by Blanton and Gasti in [7], who propose an approach based 
on a combination of garbled circuits [55] and homomorphic encryption. 

5.2 Our Construction 

A (human) iris can be digitalized as an rt-bit string S = S1S2 • • • s n with an rt-bit mask Ms = ms\ms2 • ■ ■ ms n . 
The mask indicates which bits of S have been read reliably. In particular, the z-th bit of S should be used for matching 
only if the z-th bit of Ms is set to 1. A common value for n, which we use in our experiments, is 2048. As, during 
an iris scan, the subject may rotate its head, a right or left shift can occur in the iris representation, depending on the 
direction of the rotation. Therefore, the distance between two irises A and B is computed as the minimum distance 
between all rotations of A and the representation of B. In practice, it is reasonable to assume that the rotation is limited 
to a shift of at most 5 positions towards left/right [7]. 

The matching between two irises, A and B, is computed via the Weighted Hamming Distance (WHD) of the 
samples. Let M = (M A A Mb); WHD is computed as: 

WHD(A,B,M)= HD{A \^ f ' M) (2) 

where || -|| denotes hamming weight, i.e. the number of string bits set to 1. Given a threshold t, if WHD(A, B, M) < t, 
we say that irises A and B are matching. (Assuming a maximum rotation of 5 positions, the distance must be computed 
1 1 times.) 

In the following, we propose a probabilistic technique for privately estimating of WHD(A, B, M), that relies 
on the construction for privacy -preserving estimation of Jaccard index based on MinHash (introduced in Section 3.2). 
The error on the approximation is bounded by the MinHash parameter k. 
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Alice (A) 


Bob (B) 


A, Ma (iris, mask) 


_B, Mb (iris, mask) 


R = (ri r t ) 


R = (n rt) 


Cm = Extract fl (M A , M A ) 

f 


SW = Extract^ (Ms, Ms) 

— 


a <r- vsi-ca(C m ,S m ) l-s 


} 


C = Extracts.^, Ma) 


S = Extract Ms) 


c 2 <- PSI-CA(C, 5) { -< 




Output match as (n — oz)jc\ < t 





Figure 7: Privacy -preserving iris matching of biometric A and £>. 



Proposed protocol is illustrated in Figure 7. Given any two n-bit strings X and Y and any list of k values 
R= (n, . . . , rfc), with rj € [1, n], we define: 

Extract fi (X,r) = {^ ri! . . . , w r J, where w n = | ^'^ 1}T 

Given A, M^, B, M B , Alice and Bob privately determine WHD(A, B, M A A Mb): 

• Alice and Bob negotiate k random values R = (n, . . . , rfc), with S [1, n}. 

• Alice computes Cm = Extractfj^M^, M4); Bob computes 5m = Extract # (Mb, Mb)- 

• Alice and Bob engage in a PSI-CA protocol where their inputs are Cm and Sm and Alice learns the output c\ of 
PSI-CA. 

• Alice computes C = Extractfl(A, Ma); similarly, Bob computes S = Extract/j(i?, Mb)- 

• Alice and Bob interact in a PSI-CA protocol with input C and S respectively; at the end of the protocol, Alice 
learns C2, i.e. the cardinality of the intersection. 

• Biometric A matches B iff (n — ox)lc\ < t. 

5.3 Comparison to prior work 

We now compare our technique for privacy-preserving iris matching to prior work, namely the technique in [7]. 
First, observe that protocol in Figure 7 estimates the Weighted Hamming Distance with bounded error, whereas, 
construction in [7] yields its exact computation. However, as we discuss below, the error incurred by our technique is 
low enough to be used in practice and achieves reduced computational complexity. In fact, our probabilistic protocol 
could be used to perform a fast, preliminary test: if differences between two irises are significant, then there is no need 
for further tests. Otherwise, the two parties can engage in the protocol in [7] to obtain (in a privacy-preserving way) 
a precise result. Next, as opposed to the technique in [7], Alice also learns an estimate on the number of bits set to 1 
in the combined mask Ma A Mb, but not their position. However, this information is not sensitive, thus, it does not 
leak any information about the iris sampled by Alice or Bob. 

Optimization. As discussed above, it is reasonable to assume that "Bob" (e.g., DHS) holds a database with a large 
number of biometric samples, whereas, "Alice" (e.g., TSA) has only one or few samples that she is searching in 
Bob's database. To this end, we now show how the protocol in Figure 7 can be optimized, by pre-computing several 
expensive operations offline, for such a scenario. 
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Protocol Figure 7 


Protocol in [7] 




Offline 


Online 


Offline 


Online 


Bob 


± 5-bit rot. 


0.13 ms + 5.8 s/rec 


71.5 ms/rec 


2.8 s + 71.55 ms/rec 


97.2 ms+ 134.28 ms/rec 


no rot. 


0. 13 ms + 530 ms/rec 


6.5 ms/rec 


2.6 s + 6.48 ms/rec 


97.2 ms + 12.33 ms/rec 


Alice 


± 5-bit rot. 


71.63 ms 


71.5 ms/rec 


12.2 s + 3 ms/rec 


20.34 ms/rec 


no rot. 


6.63 ms 


6.5 ms/rec 


11.7 s + 0.27 ms/rec 


1.8 ms/rec 



Table 2: Computation overhead of our randomized iris matching technique in Figure 7 and that of [7]. Experiments 
are performed with 5-bit left/right rotation and with no rotation of the iris sample. "Rot" abbreviates "rotation" and 
"rec" — "record". 

Note that Bob can perform the offline phase of PSI-CA protocol of [16] (see Figure 1) on all bits of his biometric 
samples: unlike the protocol in [7], this is required only once, independently on the number of interactions between 
Bob and any user. 

After negotiating with Bob the values R = (rx, . . . , r^), and before receiving her input, Alice pre-computes k pairs 
i&o.i = H ((0, ri)) R °, a\ t i = H((l,ri)) R °}. (This assumes the use of the PSI-CA in [16].) Once Alice's mask has 
been revealed, she constructs the corresponding encrypted representation by simply selecting the appropriate element 
from each pair. Similarly, she computes k triples (ao,i, ctx,i,oip,i) where arj,j, ai,i, a p ,i represent 0, 1 and a random 
element in {0, 1} T , respectively. Alice later uses such triples to represent each bit /3i of her iris sample as a* = otpj 
if the corresponding bit in the mask is 1, else, as a.i = a p ,i. During the online phase, Alice selects the appropriate 
pre-computed values to match the mask and the iris bits. Similarly, Bob inputs the selected bits of each record's mask 
and iris into the PSI-CA protocol. 

Performance Comparison. In Table 2, we report running times from implementations of, respectively, our protocol 
in Figure 7 and technique in [7]. We assume that about 75% of the bits in the mask are set to 1 (like in [7]). We set the 
length of each iris and mask to 2048 bits and the database size to 320 irises, which is the number used in prior work. 
All tests are run on a single Intel Xeon E5420 core running at 2.50GHz. We set k — 25, thus, obtaining an expected 
error in the order of 1/ a/25, i.e., 20%. 

Observe that online cost incurred by Bob with our technique is significantly lower compared to that of protocol 
in [7]. Whereas, it is higher for Alice. Nonetheless, summing up the computation overhead incurred by both Alice and 
Bob, our protocol always results faster that the one in [7] for the online computation. 

The offline cost imposed on Bob is about twice as high as its counterpart in protocol from [7]. However, in our 
protocol, the offline part is done once, for all possible interactions, independently from their number. Whereas, in [7], 
the offline computation needs to be performed anew, for every interaction. In settings where Bob interacts frequently 
with multiple entities, this may significantly effect protocol's overall efficiency. Furthermore, the offline cost imposed 
on Alice is markedly lower (several orders of magnitudes) using our technique. 

We conclude that the protocol in Figure 7 improves, in many settings, overall efficiency compared to state of the 
art. However, it introduces a maximum error of about 20%, whereas, the scheme in [7] compute the exact — rather 
than approximate — outcome of an iris comparison. Thus, a good practice is to use the scheme in Figure 7 to perform 
an initial selection of relevant biometric samples, using a threshold t' > t, in order to compensate for the error. The 
final matching on selected samples can then be done, in a privacy -preserving manner, using the protocol in [7]. 

6 Privacy-Preserving Multimedia File Similarity 

Amid widespread availability of digital cameras, digital audio recorders, and media-enabled smartphones, users 
generate a staggering amount of multimedia content. As a result, secure online storage (and management) of large 
volumes of multimedia data becomes increasingly desirable. According to [56], YouTube received more than 13 
million hours of video in 2010, and 48 hours are uploaded every minute (i.e., 8 years of content each day). On a 
similar note, Flickr users upload about 60 photos every second. 

Such an enormous amount of multimedia data requires automated analysis tools, e.g., Content-Based Image Re- 
trieval (CBIR) [48], that is used in the context of images. There are several available techniques to implement CBIR, 
including search techniques based on color histograms [49], bin similarity coefficients [39], texture for image char- 
acterization [36], shape features [46], edge directions [26], and matching of shape components such as corners, line 



12 







Offline 


Online 


Bob 


Exact 


0.13 ms + 33.28 ms/record 


33.28 ms/record 


Approximate 


0.13 ms + 13 ms/record 


13 ms/rec 


Alice 


Exact 


66.69 ms 


33.28 ms/record 


Approximate 


26.13 ms 


13 ms/record 



Table 3: Computation cost of our multimedia documents similarity protocol, 
segments or circular arcs [13]. 

Related Work. Motivated by the potential sensitivity of multimedia data, the research community has begun to 
develop mechanisms for secure signal processing. For instance, authors in [19] are the first to investigate secure 
signal processing related to multimedia documents. Then, the work in [34, 35] introduces two protocols to search 
over encrypted multimedia databases. Specifically, it extracts 256 visual features from each image. Then, files 
are encrypted in a distance-preserving fashion, so that encrypted features can be directly compared for similarity 
evaluation. Similarity is computed using the Jaccard index between the visual features of searched image and those 
of images in a database. However, the security of the scheme relies on order-preserving encryption (used to mask 
frequencies of recurring visual features), which is known to provide only a limited level of security [8]. 

Our Approach. We use the Jaccard index to assess the similarity of multimedia files. As showed in Section 3, we 
can do so, in a privacy-preserving way, using protocol in Figure 2. Observe that our approach is independent of the 
underlying feature extraction algorithm, even though protocol accuracy naturally relies on the quality of the feature 
extraction phase. Once features have been extracted, our privacy-preserving protocols only reveal their similarity, 
thus, without disclosing the features themselves. As an example, we instantiate our techniques for privacy-preserving 
image similarity. Our approach for feature extraction is based on [35], since its accuracy is reasonable enough for real- 
world use, using color histograms in the color space of Hue, Saturation and Value (HSV). Thus, our scheme achieves 
the same accuracy of [35], in terms of precision and recall. Once again, to obtain improved efficiency, similarity can 
be approximated using Min-Hash techniques, as per protocol in Figure 3. In this case, both performance and accuracy 
depend on the MinHash parameter k. 

Performance Evaluation. We test our technique with the same dataset used by [35], i.e., 1000 images from the 
standard Corel dataset. We extract 256 features from each image, for a total of 256,000 features for the whole database. 
We envision a user, Alice, willing to assess similarity of an image against an image database, held by Bob. We run 
our protocol for privately computing the Jaccard index ("Exact" rows in Table 3) and for estimated similarity, using 
MinHash with k = 100 ("Approximate" row). Table 3 summarizes our experiments. All tests are run on a single 
Intel Xeon E5420 core running at 2.50GHz and show that privacy protection is attainable at a very limited cost. 

Remark that a thorough performance comparison between our protocol and related work is out of the scope of this 
paper, since the main effort of prior work has been achieving high accuracy in similarity detection, rather improving 
efficiency. Thus, we defer it to future work. Nonetheless, the authors of [35] report that the running time of their 
protocol is in the order of 1 second per image, on a hardware comparable to our testbed (a dual-core 3GHz PC with 
4GB of RAM). Therefore, it is safe to assume that our protocol for privacy-preserving multimedia file similarity is 
about one order of magnitude faster than available techniques, even without considering pre-computation. 

7 Faster and Size-Hiding (Approximated) PSI-CA 

Privacy-preserving computation of set intersection cardinality has been investigated quite extensively by the re- 
search community [21, 31, 53, 16], motivated by several interesting applications, including: privacy-preserving au- 
thentication and key exchange protocols [2], data and association rule mining [53], genomic applications [5], health- 
care [31], policy-based information sharing [16], anonymous routing [57], and — as argued by this paper — sample set 
similarity. 

In many of the application scenarios, however, it may be enough to obtain an estimation, rather than the exact 
measure, of set intersection cardinality. For instance, if PSI-CA is used to privately quantify the number of common 
social-network friends (e.g., to assess profile similarity) [32], then one may want to trade off a bounded accuracy loss 
with a significant improvement in protocol overhead (and without sacrificing the level of attained privacy protection). 
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Naturally, such an improved construction is particularly appealing whenever participants' input sets are very large. 

Using MinHash techniques, we realize privacy-preserving estimation of set intersection cardinality with (constant) 
computation and communication complexities that only depend on the MinHash parameter — i.e., 0(k). Proposed 
construction is illustrated in Figure 8. Observe that while we tolerate a bounded accuracy loss — again, depending on 
MinHash' s parameter k, i.e., 0(V~k) — our protocol achieves the same, provably-secure, privacy guarantees as if we 
ran PSI-CA on whole sets. 



Privacy-preserving approximation of \A n B\ 

Run by Alice and Bob on input, resp., A and B 

1. Alice and Bob compute, {(aj, i)}i=i and {(bi, i}}* =1 , resp., using multi-hash 
MinHash where: a, ^ r h% in {A) and h ^ h% in (B) 

2. Alice and Bob execute PSI-CA on input, resp., ({(di, i)}i = i, k) and i)}i=i> k) 

3. Alice learns S = \{{ ai ,i)}f =1 n {{h, i)}fU| 

4. Bob sends w to Alice 

5. Alice outputs 5 • (v + w)/(l + 5) 

Figure 8: Our technique for Approximated Private Set Intersection Cardinality. 

Size-Hiding. Another factor motivating the use of MinHash techniques for PSI-CA is related to input size secrecy. 
Available PSI-CA protocols always disclose, from the execution, at least an upper bound on input set sizes. Whereas, 
protocol in Figure 8 conceals — unconditionally — Alice's set size, thus, achieving Size-Hiding Private Set Intersection 
Cardinality. Considering recent results motivating the need for size-hiding features in private set operations (see [3]), 
this additional feature is particularly valuable. 

8 Conclusion 

This paper introduced the first efficient construction for privacy-preserving evaluation of sample set similarity, 
relying on the Jaccard index measure. We also presented an efficient randomized protocol that approximates, with 
bounded error, this similarity index. Our techniques are generic and practical enough to be used as a basic building 
block for a wide array of different privacy-preserving functionalities, including document and multimedia file similar- 
ity, biometric matching, genomic testing, similarity of social profiles, and so on. Experimental analyses support our 
efficiency claims and demonstrate improvements over prior results. 

Naturally, our work does not end here: additional applications and extensions require further investigation. Also, 
as part of future work, we plan to investigate privacy-preserving computation of other similarity measures. 

Acknowledgments. We gratefully acknowledge Elaine Shi for providing us with the idea of genetic paternity testing 
as one of our motivating examples. 
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A Additional Details on MinHash Techniques 

Single-Hash MinHashes. Besides the multiple-hash technique presented in Section 2, another approach for approxi- 
mating the Jaccard index using MinHash employs a single hash function. In this case, rather than selecting one value 
per hash function, one selects the k values from set A that hash the to smallest integers. Specifically, let h(-) be a hash 
function, and k a fixed parameter; for any set S, define hk(S) C S as the set of the pre-images of the k smallest hash 
values of elements of S. Consider: 

■ (A m _ \{h k {h k (A) U h k (B))) n (h k (A) n h k {B))\ 
Sim[A B) - \h k (h k (A)Uh k (B))\ (3) 

It holds that sim(A, B) is an unbiased estimate of the Jaccard index of A and B. Again, by standard Chernoff 
bounds [12], the expected error is 0(1/ Vk). 

MinHash Instantiations. In order to implement the MinHash schemes described in this paper, the hash function 
should be defined by a random permutation over the set A U B. Assuming m = \A U B\, then one would need 
f2(mlogm) bits to specify a truly random permutation, thus, yielding an infeasible overhead even for small values 
of m. Broder, et al. [10] point out that one can obviate to this problem by using Min-wise Independent Permutation 
(MWIP) families rather than random permutations. Using MWIPs, for any subset of the domain, any element is equally 
likely to be the minimum, but the number of bits to specify such a permutation is showed to be still be relatively large, 
i.e., n(m). In practice, however, one can allow certain relaxations. To this end, [10] introduces approximate MWIPs, 
by accepting a small error e. The authors require all items in a set S to have only a (almost equal) chance to become 
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the minimum element of A's image under the permutation. Thus, for any approximate MWIP, implemented using 
^mm(') as defined above, for an expected relative error e, it holds: 
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A class of permutations often used in practice is one based on linear transformations. It is assumed that the universe 
is Z p for some prime p and the family of permutation is constructed using a hash function computed as h(x) ~ ax + b 
mod p, where (a, b) £ (Z*, 1 p ). Such a linear transformations are easy to represent and efficiently calculable. 



B Flaw in Private Document Similarity in [28] 

In this section, we show that the protocol in [28] is not privacy-preserving (even in semi-honest model). In fact, 
Bob, in order to participate in the protocol, must disclose his global space of tri-grams. Given this information, Alice 
can efficiently check whether a word, e.g., w, appears in Bob's document collection. Indeed, Alice computes w's 
tri-gram based representation, then she checks whether all such tri-grams appear in Bob's public global space. If so, 
Alice learns that w appears in a document held by Bob with some non-zero probability. Technically, this probability 
is not 1 because Alice could have a false positive, i.e. w may not be in Bob's documents even though w's trigrams 
are in Bob's public global space. On the other hand, if at least one of the tri-grams of w is not in Bob's public global 
space, Alice learns that Bob's documents do not contain w. This, obviously, violates privacy requirements. If Alice 
and Bob include punctation and spaces in their tri-grams representation of their documents, the probability of false 
positive becomes negligible. We do not exploit "relations" between consecutive meaningful words in the sentence, 
which could potentially (further) aggravate information leakage about Bob's documents. 

We now show yet another attack that lets Alice learn even more, since the N-grams representation embeds doc- 
ument's structure. From the global space of tri-grams QS, we can construct a directed graph G(V,E) representing 
relations between tri-grams in Bob's document collection. Any path in such a graph will lead to a textual fragment 
contained in some document held by Bob. A vertex in the graph represents a tri-gram; whereas, an edge between two 
vertices implies that the two corresponding tri-grams are consecutive tri-grams in a word. Given a trigram x £ QS, 
with scW we denote the i-th letter in x. The directed graph G(V, E) is constructed as follows. The vertex set is 
V = {V x | x e QS} and the edge set is E = {(V x ,V y ) | ir (2) = A x^ 3) = y^}. A path V Xl ,..., V Xn in G, will 
correspond to the string Such a string (or some of its substring) appears in some document 

in Bob's collection. By using algorithms based on Deep First Search visit of a graph, a vocabulary, and syntactic rules, 
we could extract large document' s chunks. We did not explore further other techniques to extract ' 'information' ' from 
the global space of tri-grams as we consider them to be out of the scope of this paper. 
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